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Abstract 

http:// Exploiting a priori known structural information lies at the core of many image reconstruction methods 
that can be stated as inverse problems. The synthesis model, which assumes that images can be decomposed into 
a linear combination of very few atoms of some dictionary, is now a well established tool for the design of image 
reconstruction algorithms. An interesting alternative is the analysis model, where the signal is multiplied by an 
analysis operator and the outcome is assumed to be sparse. This approach has only recently gained increasing 
interest. The quality of reconstruction methods based on an analysis model severely depends on the right choice of 
the suitable operator. 

In this work, we present an algorithm for learning an analysis operator from training images. Our method is 
based on £ p -novm minimization on the set of full rank matrices with normalized columns. We carefully introduce 
the employed conjugate gradient method on manifolds, and explain the underlying geometry of the constraints. 
Moreover, we compare our approach to state-of-the-art methods for image denoising, inpainting, and single image 
super-resolution. Our numerical results show competitive performance of our general approach in all presented 
applications compared to the specialized state-of-the-art techniques. 

Index Terms 

Analysis Operator Learning, Inverse Problems, Image Reconstruction, Geometric Conjugate Gradient, Oblique 
Manifold 



I. INTRODUCTION 

A. Problem Description 

LINEAR inverse problems are ubiquitous in the field of image processing. Prominent examples are image 
denoising (TJ, inpainting (2), super-resolution (3), or image reconstruction from few indirect measurements as 
in Compressive Sensing [4|. Basically, in all these problems the goal is to reconstruct an unknown image sGP 
as accurately as possible from a set of indirect and maybe corrupted measurements y E M m with n > m, see (5| 
for a detailed introduction to inverse problems. Formally, this measurement process can be written as 

y = *4s + e, (1) 

where the vector e E M m models sampling errors and noise, and A E R mXn is the measurement matrix modeling 
the sampling process. In many cases, reconstructing s by simply inverting Equation ([I]) is ill-posed because either 
the exact measurement process and hence A is unknown as in blind image deconvolution, or the number of 
observations is much smaller compared to the dimension of the signal, which is the case in Compressive Sensing 
or image inpainting. To overcome the ill-posedness and to stabilize the solution, prior knowledge or assumptions 
about the general statistics of images can be exploited. 
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B. Synthesis Model and Dictionary Learning 

One assumption that has proven to be successful in image reconstruction, cf. [6|, is that natural images admit a 
sparse representation x E IR d over some dictionary V E W ixd with d > n. A vector x is called sparse when most 
of its coefficients are equal to zero or small in magnitude. When s admits a sparse representation over V, it can be 
expressed as a linear combination of only very few columns of the dictionary {di}f =1 , called atoms, which reads 
as 

s = Px. (2) 

For d > n, the dictionary is said to be overcomplete or redundant. 

Now, using the knowledge that ([2]) allows a sparse solution, an estimation of the original signal in ([!]) can be 
obtained from the measurements y by first solving 

x* = arg min g(x) subject to ||*4Dx — y ||| < e, (3) 

and afterwards synthesizing the signal from the computed sparse coefficients via s* = Dx*. Therein, g : R d — » R 
is a function that promotes or measures sparsity, and e E M + is an estimated upper bound on the noise power 1 1 e 1 1 § - 
Common choices for g include the ^,-norm 

l|v||£:=5>i| p , (4) 

i 

with < p < 1 and differentiable approximations of Q. As the signal is synthesized from the sparse coefficients, 
the reconstruction model ([3]) is called the synthesis reconstruction model (7J. 

To find the minimizer of Problem ([3]), various algorithms based on convex or non-convex optimization, greedy 
pursuit methods, or Bayesian frameworks exist that may employ different choices of g. For a broad overview 
of such algorithms, we refer the interested reader to [8|. What all these algorithms have in common, is that their 
performance regarding the reconstruction quality severely depends on an appropriately chosen dictionary V. Ideally, 
one is seeking for a dictionary where s can be represented most accurately with a coefficient vector x that is as 
sparse as possible. Basically, dictionaries can be assigned to two major classes: analytic dictionaries and learned 
dictionaries. 

Analytic dictionaries are built on mathematical models of a general type of signal, e.g. natural images, they 
should represent. Popular examples include Wavelets [9|, Bandlets (10), and Curvlets fTT) among several others, 
or a concatenation of various such bases/dictionaries. They offer the advantages of low computational complexity 
and of being universally applicable to a wide set of signals. However, this universality comes at the cost of not 
giving the optimally sparse representation for more specific classes of signals, e.g. face images. 

It is now well known that signals belonging to a specific class can be represented with fewer coefficients over a 
dictionary that has been learned using a representative training set, than over analytic dictionaries. This is desirable 
for various image reconstruction applications as it readily improves their performance and accuracy fT2)-[[T4). 
Basically, the goal is to find a dictionary over which a training set admits a maximally sparse representation. In 
contrast to analytic dictionaries, which can be applied globally to an entire image, learned dictionaries are small 
dense matrices that have to be applied locally to small image patches. Hence, the training set consists of small 
patches extracted from some example images. This restriction to patches mainly arises from limited memory, and 
limited computational resources. 

Roughly speaking, starting from some initial dictionary the learning algorithms iteratively update the atoms of 
the dictionary, such that the sparsity of the training set is increased. This procedure is often performed via block- 
coordinate relaxation, which alternates between finding the sparsest representation of the training set while fixing 
the atoms, and optimizing the atoms that most accurately reproduce the training set using the previously determined 
sparse representation. Three conceptually different approaches for learning a dictionary became well established, 
which are probabilistic ones like (15), clustering based ones such as K-SVD fT6) , and recent approaches which aim 
at learning dictionaries with specific matrix structures that allow fast computations like (IT). For a comprehensive 



overview of dictionary learning techniques see |18|. 
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C Analysis Model 

An alternative to the synthesis model ([3]) for reconstructing a signal, is to solve 

s* = arg min g(fts) subject to ||*4s — y||| < e, (5) 

which is known as the analysis model |[7|. Therein, ft G M. kxn with k > n is called the analysis operator, and the 
analyzed vector fis E M fc is assumed to be sparse, where sparsity is again measured via an appropriate function g. 
In contrast to the synthesis model, where a signal is fully described by the non-zero elements of x, in the analysis 
model the zero elements of the analyzed vector fis described the subspace containing the signal. To emphasize this 
difference, the term cosparsity has been introduced in |T9| , which simply counts the number of zero elements of 
fis. As the sparsity in the synthesis model depends on the chosen dictionary, the cosparsity of an analyzed signal 
depends on the choice of the analysis operator f2. 

Different analysis operators proposed in the literature include the fused Lasso (20} , the translation invariant 
wavelet transform pT| , and probably best known the finite difference operator which is closely related to the 
total- variation |22|. They all have shown very good performance when used within the analysis model for solving 
diverse inverse problems in imaging. The question is: Can the performance of analysis based signal reconstruction 
be improved when a learned analysis operator is applied instead of a predefined one, as it is the case for the 
synthesis model where learned dictionaries outperform analytic dictionaries? In (7), it has been discussed that the 
two models differ significantly, and the naive way of learning a dictionary and simply employing its transposed 
or its pseudo-inverse as the learned analysis operator fails. Hence, different algorithms are required for analysis 
operator learning. 



D. Contributions 

In this work, we introduce a new algorithm based on geometric optimization for learning a patch based analysis 
operator from a set of training samples, which we name GOAL (GeOmetric Analysis operator Learning). The 



method relies on a minimization problem, which is carefully motivated in Section |II-B| Therein, we also discuss 
the question of what is a suitable analysis operator for image reconstruction, and how to antagonize overfitting the 
operator to a subset of the training samples. An efficient geometric conjugate gradient method on the so-called 
oblique manifold is proposed in Section [III] for learning the analysis operator. Furthermore, in Section 
how to apply the local patch based analysis operator to achieve global reconstruction results. Section 



IV we explain 
V] sheds some 



light on the influence of the parameters required by GOAL and how to select them, and compares our method to 
other analysis operator learning techniques. The quality of the operator learned by GOAL on natural image patches 
is further investigated in terms of image denoising, inpainting, and single image super-resolution. The numerical 
results show the broad and effective applicability of our general approach. 



E. Notations 

Matrices are written as capital calligraphic letters like X, column vectors are denoted by boldfaced small letters 
e.g. x, whereas scalars are either capital or small letters like n, N. By v\ we denote the i th element of the vector 
v, Vij denotes the i th element in the j th column of a matrix V. The vector v : ^ denotes the i th column of V whereas 
v^ : denotes the transposed of the i th row of V. By 6ij 9 we denote a matrix whose i th entry in the j th column is 
equal to one, and all others are zero. denotes the identity matrix of dimension (kxk), denotes the zero-matrix 
of appropriate dimension, and ddiag(V) is the diagonal matrix whose entries on the diagonal are those of V. By 
II^IIf = v ij we denote the squared Frobenius norm of a matrix V, tr(V) is the trace of V, and rk(V) denotes 
the rank. 



II. Analysis Operator Learning 

A. Prior Art 

The topic of analysis operator learning has only recently started to be investigated, and only few prior work 
exists. In the sequel, we shortly review analysis operator learning methods that are applicable for image processing 
tasks. 
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Given a set of M training samples {s^ G R 71 }^, the goal of analysis operator learning is to find a matrix 
f2 g M. kxn with k > n, which leads to a maximally cosparse representation ftsi of each training sample. As 



mentioned in Subsection |I-B[ the training samples are distinctive vectorized image patches extracted from a set of 
example images. Let S = [si, . . . ,sm] G R nxM be a matrix where the training samples constitute its columns, 
then the problem is to find 

= arg min G(flS), (6) 

where fi is subject to some constraints, and G is some function that measures the sparsity of the matrix fi<S. In |23) , 
an algorithm is proposed in which the rows of the analysis operator are found sequentially by identifying directions 
that are orthogonal to a subset of the training samples. Starting from a randomly initialized vector u G R n , a 
candidate row is found by first computing the inner product of u> with the entire training set, followed by extracting 
the reduced training set Sr of samples whose inner product with u? is smaller than a threshold. Thereafter, u is 
updated to be the eigenvector corresponding to the smallest eigenvalue of SrS^. This procedure is iterated several 
times until a convergence criterion is met. If the determined candidate vector is sufficiently distinctive from already 
found ones, it is added to as a new row, otherwise it is discarded. This process is repeated until the desired 
number k of rows have been found. 

An adaption of the widely known K-SVD dictionary learning algorithm to the problem of analysis operator 
learning is presented in p4| . As in the original K-SVD algorithm, G(ftS) = J2i ll^ s i||o is employed as the 
sparsifying function and the target cosparsity is required as an input to the algorithm. The arising optimization 
problem is solved by alternating between a sparse coding stage over each training sample while fixing f2 using an 
ordinary analysis pursuit method, and updating the analysis operator using the optimized training set. Then, each 



row of f2 is updated in a similar way as described in the previous paragraph for the method of (23). Interestingly, 
the operator learned on piecewise constant image patches by (23) and [24) closely mimics the finite difference 
operator. 

In (25) , the authors use G(QS) = J2i ll^ s i||i as th e sparsity promoting function and suggest a constrained 
optimization technique that utilizes a projected subgradient method for iteratively solving ([6]). To exclude the trivial 
solution, the set of possible analysis operators is restricted to the set of Uniform Normalized Tight Frames, i.e. 
matrices with uniform row norm and orthonormal columns. The authors state that this algorithm has the limitation 
of requiring noiseless training samples whose analyzed vectors {fis^}^ are exactly cosparse. 

To overcome this restriction, the same authors propose an extension of this algorithm that simultaneously learns 
the analysis operator and denoises the training samples, cf. [26) . This is achieved by alternating between updating 
the analysis operator via the projected subgradient algorithm and denoising the samples using an Augmented 
Lagrangian method. Therein, the authors state that their results for image denoising using the learned operator are 
only slightly worse compared to employing the commonly used finite difference operator. 

An interesting idea related to the analysis model, called Fields-of-Experts (FoE) has been proposed in [ [27) . 
The method relies on learning high-order Markov Random Field image priors with potential functions extending 
over large pixel neighborhoods, i.e. overlapping image patches. Motivated by a probabilistic model, they use the 
student-t distribution of several linear filter responses as the potential function, where the filters, which correspond 
to atoms from an analysis operator point of view, have been learned from training patches. Compared to our work 
and the methods explained above, their learned operator used in the experiments is underdetermined, i.e. k < n, the 
algorithms only works for small patches due to computational reasons, and the atoms are learned independently, 
while in contrast GOAL updates the analysis operator as a whole. 

B. Motivation of Our Approach 

In the quest for designing an analysis operator learning algorithm, the natural question arises: What is a good 
analysis operator for our needs? Clearly, given a signal s that belongs to a certain signal class, the aim is to find an 
fi such that fis is as sparse as possible. This motivates to minimize the expected sparsity E[#(f2s)]. All approaches 



presented in Subsection |II-A| can be explained in this way, i.e. for their sparsity measure g they aim at learning 
an ft that minimizes the empirical mean of the sparsity over all randomly drawn training samples. This, however, 
does not necessarily mean to learn the optimal f2 if the purpose is to reconstruct several signals belonging to a 
diverse class, e.g. natural image patches. The reason for this is that even if the expected sparsity is low, it may 
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Fig. 1. Illustration of two possible distributions Pr(g(Qis) < x) for two analysis operators. Q±: low expectation g 1 , high variance (dashed 
line); f^: moderate expectation g 2 , moderate variance. Although fli yields a smaller expectation, there are more signals compared to Q.2 
where the sparsity model fails, i.e. Pr(g(ftis) > u) > Pr(g(Q2&) > u) for a suitable upper bound u. 



happen with high probability that some realizations of this signal class cannot be represented in a sparse way, i.e. 
that for a given upper bound u, the probability Pr(g(fts) > u) exceeds a tolerable value, cf. Figure [T] 

The algorithm presented here aims at minimizing the empirical expectation of a sparsifying function g(ftsi) for 
all training samples s^, while additionally keeping the empirical variance moderate. In other words, we try to avoid 
that the analyzed vectors of many similar training samples become very sparse and consequently prevent ft from 
being adapted to the remaining ones. For image processing, this is of particular interest if the training patches 
are chosen randomly from natural images, because there is a high probability of collecting a large subset of very 
similar patches, e.g. homogeneous regions, that bias the learning process. 

Concretely, we want to find an ft that minimizes both the squared empirical mean 

5 2 =(ifE^)) 2 (7) 

i 

and the empirical variance 

s2 = i/E(^)-^) 2 (8) 

i 

of the sparsity of the analyzed vectors. We achieve this by minimizing the sum of both, which is readily given by 

9 2 + S 2 4^#,) 2 (9) 

i 

Using g(-) = ||-||^, and introducing the factor \ the function we employ reads as 

M k M 

j=i i=\ j=i 

with < p < 1 and V = QS. 

Certainly, without additional prior assumptions on fi, the useless solution = is the global minimizer of 
Problem ([6]). To avoid the trivial solution and for other reasons explained later in this section, we regularize the 
problem by imposing the following three constraints on ft. 

(i) The rows of ft have unit Euclidean norm, i.e. ||u>i, : ||2 = 1 for i = 1, . . . , fe. 

(ii) The analysis operator ft has full rank, i.e. rk(fi) = n. 

(iii) The analysis operator ft does not have linear dependent rows, i.e. ^ ±<^j, : for i ^ j. 

The rank condition (ii) on ft is motivated by the fact that different input samples Si, S2 E W 1 with si ^ S2 should 
be mapped to different analyzed vectors fisi ^ fis2. With Condition (iii) redundant transform coefficients in an 
analyzed vector are avoided. 
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These constraints motivate the consideration of the set of full rank matrices with normalized columns, which 
admits a manifold structure known as the oblique manifold (28j 

OB(ra,fc) := {X E R nxk \rk(X) = n, ddiag(A ,T A') =l k }. (11) 

Note, that this definition only yields a non-empty set if k > n, which is the interesting case in this work. Thus, 
from now on, we assume k > n. Remember that we require the rows of fi to have unit Euclidean norm. Hence, 
we restrict the transposed of the learned analysis operator to be an element of OB(n, k). 

Since OB(n, k) is open and dense in the set of matrices with normalized columns, we need a penalty function 
that ensures the rank constraint (ii) and prevents iterates to approach the boundary of OB(n, k). 

Lemma 1: The inequality < det(^XX T ) < (^) n holds true for all X E OB(n, k), where 1 < n < k. 

Proof: Due to the full rank condition on X, the product XX T is positive definite, consequently the strict 
inequality < det(^XX T ) applies. To see the second inequality of Lemma [lj observe that 

\\X\\ 2 F = tY(XX T ) = k, (12) 

which implies tr(^XX T ) = 1. Since the trace of a matrix is equal to the sum of its eigenvalues, which are strictly 
positive in our case, it follows that the strict inequality < A^ < 1 holds true for all eigenvalues A^ of ^XX T . 
From the well known relation between the arithmetic and the geometric mean we see 

Now, since the determinant of a matrix is equal to the product of its eigenvalues, and with K = ^ T {\XX T ) = 1» 
we have 

det(lXX T ) = IIA, < (i) n , (14) 

which completes the proof. ■ 
Recalling that ft T E OB(n, k) and considering Lemma[l] we can enforce the full rank constraint with the penalty 
function 

^)==-^)logdet(^ T ")- (15) 
Regarding Condition (iii), the following result proves useful. 

Lemma 2: For a matrix X E OB(n 5 fc) with 1 < n < k, the inequality |x T ^x : j| < 1 applies, where equality 
holds true if and only if x : ^ = ±x : j. 

Proof: By the definition of OB(n, k) the columns of X are normalized, consequently Lemma [2] follows directly 
from the Cauchy-Schwarz inequality. ■ 

Thus, Condition (iii) can be enforced via the logarithmic barrier function of the scalar products between all 
distinctive rows of f2, i.e. 

r(ft):=-J2 log(l - (wjw,- :) 2 )- (16) 

l<i<j<k 

Finally, combining all the introduced constraints, our optimization problem for learning the transposed analysis 
operator reads as 

fl T = arg min J P (X T S) + k h(X T ) + (j, r(X T ). (17) 

Therein, the two weighting factors R + control the influence of the two constraints on the final solution. The 

following lemma clarifies the role of k. 

Lemma 3: Let fi be a minimum of h in the set of transposed oblique matrices, i.e. 

n T E arg mmh(X T ), (18) 

X£OB(n,k) 

then the condition number of Q is equal to one. 

Proof: It is well known that equality of the arithmetic and the geometric mean in Equation ( [T3] ) holds true, 
if and only if all eigenvalues A^ of \XX T are equal, i.e. Ai = . . . = A n . Hence, if f2 T E arg min h(X T ), then 

XeOB(n,k) 
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det(|fi T fi) = and consequently all singular values of fi coincide. This implies that the condition number 

of fi, which is defined as the quotient of the largest to the smallest singular value, is equal to one. ■ 
With other words, the minima of h are uniformly normalized tight frames, cf. |25|, [26| . From Lemma [3] we can 



conclude that with larger k the condition number of f2 approaches one. Now, recall the inequality 

CTmin||si - S2H2 < ||0(S1 - S 2 )|| 2 < CTmax||si - S 2 1 1 2 , (19) 

with cr m i n being the smallest and a max being the largest singular value of fi. From this it follows that an analysis 
operator found with a large k, i.e. obeying <r m i n « cr max , carries over distinctness of different signals to their analyzed 
versions. The parameter ji regulates the redundancy between the rows of the analysis operator and consequently 
avoids redundant coefficients in the analyzed vector fis. 

Lemma 4: The difference between any two entries of the analyzed vector fis is bounded by 



|u;Js - u].s\ < W2(l - uj.ujj.) ||s|| 2 . (20) 



Proof: From the Cauchy-Schwarz inequality we get 



|cj^.s — a;T.s| = |(^ 5 : —ujj^) T s\ < \\u)i j: — 1 1 2 1 1 s 1 1 2 - (21) 



Since by definition ||u;^ : ||2 = H^jvlb = 1> ^ follows that ||u;^ : — ojj^ = y 2(1 — wj.ujj^). ■ 
The above lemma implies, that if the i th entry of the analyzed vector is significantly larger than then a large 
absolute value of ujj.ujj^ prevents the j th entry to be small. To achieve large cosparsity, this is an unwanted effect 
that our approach avoids via the log-barrier function r in ( [T6| ). It is worth mentioning that the same effect is 
achieved by minimizing the analysis operator's mutual coherence max\uJ.Uj i: \ and that our experiments suggest 

that enlarging \i leads to minimizing the mutual coherence. 

In the next section, we explain how the manifold structure of OB(n, k) can be exploited to efficiently learn the 
analysis operator. 



III. Analysis Operator Learning Algorithm 

Knowing that the feasible set of solutions to Problem ( fTT] ) is restricted to a smooth manifold allows us to 
formulate a geometric conjugate gradient (CG-) method to learn the analysis operator. Geometric CG-methods have 
been proven efficient in various applications, due to the combination of moderate computational complexity and 
good convergence properties, see e.g. (29j for a CG-type method on the oblique manifold. 

To make this work self contained, we start by shortly reviewing the general concepts of optimization on matrix 
manifolds. After that we present the concrete formulas and implementation details for our optimization problem 
on the oblique manifold. For an in-depth introduction on optimization on matrix manifolds, we refer the interested 
reader to [30 1 . 



A. Optimization on Matrix Manifolds 

Let M be a smooth Riemannian submanifold of R nxfc with the standard Frobenius inner product (Q,V) := 
tr(Q T/ P), and let /: R nxk —> R be a differentiable cost function. We consider the problem of finding 

arg min/(AQ. (22) 

The concepts presented in this subsection are visualized in Figure [2] to alleviate the understanding. 

To every point X E M one can assign a tangent space T^M. The tangent space at X is a real vector space 
containing all possible directions that tangentially pass through X. An element E E T^M is called a tangent vector 
at X. Each tangent space is associated with an inner product inherited from the surrounding R nxfc , which allows 
to measure distances and angles on M. 

The Riemannian gradient of / at X is an element of the tangent space T^M that points in the direction of 
steepest ascent of the cost function on the manifold. As we require M to be a submanifold of R nxh and since by 
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y/(#) 



G := u TxM (vf(x)) 




Fig. 2. This figure shows two points X and ^ on a manifold M together with their tangent spaces TxM and TyM. Furthermore, the 
Euclidean gradient V/(A?) and its projection onto the tangent space Ht x m(S f{X)) are depicted. The geodesic T(X, t) in the direction 
of % G TxM connecting the two points is shown. The dashed line typifies the role of a parallel transport of the gradient in TxM to TyM. 



assumption / is defined on the whole R nx/c , the Riemannian gradient G(X) is simply the orthogonal projection of 
the (standard) gradient V f(X) onto the tangent space T#M. In formulas, this reads as 

G(X) :=IIt*m(V/(A0). (23) 

A geodesic is a smooth curve T(X, H, t) emanating from X in the direction of 5 £ T#M, which locally describes 
the shortest path between two points on M. Intuitively, it can be interpreted as the equivalent of a straight line in 
the manifold setting. 

Conventional line search methods search for the next iterate along a straight line. This is generalized to the 
manifold setting as follows. Given a current optimal point XW and a search direction E T^wM at the i th 
iteration, the step size aW which leads to sufficient decrease of / can be determined by finding the minimizer of 

aW = arg mm f(T(X®,H w ,t)). (24) 
t>o 

Once o;W has been determined, the new iterate is computed by 

X&+V = T(X®,H®,a®). (25) 
Now, one straightforward approach to minimize / is to alternate Equations (23 ), (24), and (25 ) using 

= -gift 

with the short hand notation C/W := G(X^), which corresponds to the steepest descent on a Riemannian manifold. 
However, as in standard optimization, steepest descent only has a linear rate of convergence. Therefore, we employ 
a conjugate gradient method on a manifold, as it offers a superlinear rate of convergence, while still being applicable 
to large scale optimization problems with low computational complexity. 

In CG-methods, the updated search direction 7^ +1 ) E T^(»+i)M is a linear combination of the gradient E 
T X (i+i)M and the previous search direction %W E T^wM. Since adding vectors that belong to different tangent 
spaces is not defined, we need to map from T^oM to T^(»+i)M. This is done by the so-called parallel 
transport T(H, AfW, qW), which transports a tangent vector H E T^wM along the geodesic r(A , W,7<W,t) 
to the tangent space T^i+^M. Now, using the shorthand notation 

r| +1) := T&X^\U®,ol®), (26) 
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the new search direction is computed by 

ft( < + 1 > = -0(*+ 1 )+0(O 7 £+ 1 ) ) (27) 

where /?W ER is calculated by some update formula adopted to the manifold setting. Most popular are the update 
formulas by Fletcher-Reeves (FR), Hestenes-Stiefel (HS), and Dai-Yuan (DY). With J^ 1 ) = Q^+ l ) - lf^\ they 
read as 

w _ (gP+D.gE+D) 

(0(0,0(0) ' (2 * } 



V "(T^),^)}- (30) 



Now, a solution to Problem ([22]) is computed by alternating between finding the search direction on M and 
updating the current optimal point until some user- specified convergence criterion is met, or a maximum number 
of iterations has been reached. 



B. Geometric Conjugate Gradient for Analysis Operator Learning 

In this subsection we derive all ingredients to implement the geometric conjugate gradient method as described 
in the previous subsection for the task of learning the analysis operator. Results regarding the geometry of OB(n, k) 
are derived e.g. in p0| . To enhance legibility, and since the dimensions n and k are fixed throughout the rest of 
the paper, the oblique manifold is further on denoted by OB. 

The tangent space at X G OB is given by 

T x OB = {£ G R nxk \ ddiag(A ,T S) = 0}. (31) 

The orthogonal projection of a matrix Q G R nxk onto the tangent space T^OB is 

K Tx ob(Q) = Q - *ddiag(* T Q). (32) 

Regarding geodesies, note that in general a geodesic is the solution of a second order ordinary differential equation, 
meaning that for arbitrary manifolds, its computation as well as computing the parallel transport is not feasible. 
Fortunately, as the oblique manifold is a Riemannian submanifold of a product of k unit spheres S n_1 , the formulas 
for parallel transport and the exponential mapping allow an efficient implementation. 

Let x G 5 n_1 be a point on a sphere and h G T x S rn_1 be a tangent vector at x, then the geodesic in the direction 
of h is a great circle 

f x, *f||h|| 2 = 

7(x,h,t) = ^ x , , sin(i||h|| a ) „ • ( 33 ) 

y xcos(t||h||2) + h — ||h|| 2 ? otherwise. 
The associated parallel transport of a tangent vector £ G T^S 71-1 along the great circle 7(x, h, t) reads as 

£ T h 

|2 
12 

h(l-cos(t||h|| 2 )) ). (34) 



r(£, x, h, t) = £ - ( x||h|| 2 sin(t||h|| 2 ) + 



As OB is a submanifold of the product of unit spheres, the geodesic through X G OB in the direction of 
H G T^OB is simply the combination of the great circles emerging by concatenating each column of X with the 
corresponding column of H, i.e. 



r(x,n,t) 



7(x : ,i,h :j i,i), . . .,7(x :; fc,h :)Jfc ,t) 



(35) 
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Accordingly, the parallel transport of S E T#OB along the geodesic T(X,1-L,t) is given by 

r(s,Ar,H,t) = 

t(£:,i, x :> i, h :> i, t), . . . , r(^ 5fc , x : ^, h : ,jfe, t) . 



(36) 



Now, to use the geometric CG-method for learning the analysis operator, we require a differentiable cost function 
/. Since, the cost function presented in Problem ( [XT] ) is not differentiable due to the non-smoothness of the (p, q)- 
pseudo-norm ( [TO] ), we exchange Function ( fTO] ) with a smooth approximation, which is given by 

x 2 



M 



v ?;=i 



(37) 



with i/ E M + being the smoothing parameter. The smaller v is, the more closely the approximation resembles the 
original function. Again, taking V = VtS and with the shorthand notation Zij := (QS)ij, the gradient of the applied 
sparsity promoting function ([37]) reads as 



M 



3=1 i=l 



k 

£ 

1=1 



The gradient of the rank penalty term ( fT5] ) is 



<9fi v / kn\og(n) 

and the gradient of the logarithmic barrier function ([16]) is 



^yn(in T n) 



&»■(«) = 



E 



l<i<j<k 



Combining Equations ([38]), ( [39] ), and ( |40| ), the gradient of the cost function 



f(X) : = J P>1/ (^ T 5) + « /i(^ T ) + // rfVt 1 



which is used for learning the analysis operator reads as 



(38) 



(39) 



(40) 



(41) 



(42) 



Regarding the CG-update parameter /3^\ we employ a hybridization of the Hestenes-Stiefel Formula (|29f) and 
the Dai Yuan formula 00b 



$j b = max (0, min(/3$, , (43) 

which has been suggested in (3TJ. As explained therein, formula ( [43] ) combines the good numerical performance 
of HS with the desirable global convergence properties of DY. 

Finally, to compute the step size we use an adaption of the well-known backtracking line search to the 
geodesic Y{X^\T-L^\t). In that, an initial step size t$ is iteratively decreased by a constant factor c\ < 1 until 
the Armijo condition is met, see Algorithm [T] for the entire procedure. In our implementation we empirically chose 
c\ = 0.9 and C2 = 10 -2 . As an initial guess for the step size at the first CG-iteration i = 0, we choose 



(o) 



|£ (0) | 



-l 

F ' 



(44) 



as proposed in (32). In the subsequent iterations, the backtracking line search is initialized by the previous step 



size divided by the line search parameter, i.e. £q 



v(i-l) 



Our complete approach for learning the analysis 



operator is summarized in Algorithm [2] Note, that under the conditions that the Fletcher-Reeves update formula is 
used and some mild conditions on the step-size selection, the convergence of Algorithm [2] to a critical point, i.e. 
liminf^oo ||<5^|| = 0, is guaranteed by a result provided in p3|. 
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Algorithm 1 Backtracking Line Search on Oblique Manifold 

Input: 4° > 0, < ci < 1, < C2 < 0.5, A^W^W^W 
Set: t <- 4 i} 

while f{Y{X {i \U {i \t)) > f(X®)+tc 2 (g®,H®) do 

t ^— ci* 
end while 
Output: aW <- t 



Algorithm 2 Geometric Analysis Operator Learning (GOAL) 



Input: Initial analysis operator f2; m - r , training data 5, parameters p, i/, as, // 
Set: i <- 0, <- O^, «(°> <- -G^ 
repeat 

o;W ^— arg min f{Y(X^ 1 U {i \t)), cf. Algorithm □ in conjunction with Equation gT) 

^(i+i) <_ r(AfW,«W,aW), cf. Equation ([35]) 
^(i+i) <_ U Tx(i+1)M (Vf(X^)), cf. Equations ([32]) and ([42} 
/3« <- max (0,min(/3^,/3^)), cf. Equations ([29]), ([30]) 
<_ + cf. Equations ([26]), ([36]) 

i ^— i + 1 

until - < 1CT 4 V i = maximum # iterations 

Output: n* <- 



IV. Analysis Operator based Image Reconstruction 

In this section we explain how the analysis operator f2* E R kxn is utilized for reconstructing an unknown image 
s E from some measurements y E R m following the analysis approach ([5]). Here, the vector s E denotes a 
vectorized image of dimension N = wh, with w being the width and h being the height of the image, respectively, 
obtained by stacking the columns of the image above each other. In the following, we will loosely speak of s as 
the image. 

Remember, that the size of fi* is very small compared to the size of the image, and it has to be applied locally to 
small image patches rather than globally to the entire image. Artifacts that arise from naive patch-wise reconstruction 
are commonly reduced by considering overlapping patches. Thereby, each patch is reconstructed individually and 
the entire image is formed by averaging over the overlapping regions in a final step. However, this method misses 
global support during the reconstruction process, hence, it leads to poor inpainting results and is not applicable for 
e.g. Compressive Sensing tasks. To overcome these drawbacks, we use a method related to the patch based synthesis 
approach from |T2) and the method used in (27j, which provides global support from local information. Instead 
of optimizing over each patch individually and combining them in a final step, we optimize over the entire image 
demanding that a pixel is reconstructed such that the average sparsity of all patches it belongs to is minimized. When 
all possible patch positions are taken into account, this procedure is entirely partitioning-invariant. For legibility, 
we assume square patches i.e. of size (y/n x y/n) with y/n being a positive integer. 

Formally, let r C {1, . . . , h} and c C {1, . . . , w} denote sets of indices with r^+i — Ti = d v , c^+i — q = dh 
and 1 < d Vl dh < y/n. Therein, d Vl dh determine the degree of overlap between two adjacent patches in vertical, 
and horizontal direction, respectively. We consider all image patches whose center is an element of the cartesian 
product set r x c. Hence, with | • | denoting the cardinality of a set, the total number of patches being considered 
is equal to |r||c|. Now, let V rc be a binary (n x N) matrix that extracts the patch centered at position (r, c). With 
this notation, we formulate the (global) sparsity promoting function as 

k 

EEE((^ c s)? + #, (45) 

r£r cGc %—\ 
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which measures the overall approximated ^-pseudo-norm of the considered analyzed image patches. We compactly 
rewrite Equation ([45]) as 

K 

5 (^s):=]T((ft F s)? + l /) ? , (46) 



with K = fe|r||c| and 



n 1 



i=i 



ric 2 



r |r|C|c| 



dKxN 



(47) 



being the global analysis operator that expands the patch based one to the entire image. We treat image boundary 
effects by employing constant padding, i.e. replicating the values at the image boundaries times, where |_-J 

denotes rounding to the smaller integer. Certainly, for image processing applications fi F is too large for being 
applied in terms of matrix vector multiplication. Fortunately, applying Q F and its transposed can be implemented 
efficiently using sliding window techniques, and the matrix vector notation is solely used for legibility. 

According to (34), we exploit the fact that the range of pixel intensities is limited by a lower bound b\ and an 

N 

E Ksi), 
i=i 



upper bound b u . We enforce this bounding constraint by minimizing the differentiable function 6(s) 
where b is a penalty term given as 




if s > b u 
if s < bi 
otherwise 



(48) 



Finally, combining the two constraints ( |46| ) and ( [48] ) with the data fidelity term, the analysis based image 
reconstruction problem is to solve 

s* = arg min^ll^s - y||| + 6(s) + Xg(n F s). (49) 

seR N 

Therein, A E M + balances between the sparsity of the solution's analysis coefficients and the solution's fidelity to 
the measurements. The measurement matrix A E R mx7V and the measurements y E R m are application dependent. 



V. Evaluation and Experiments 

The first part of this section aims at answering the question of what is a good analysis operator for solving image 
reconstruction problems and relates the quality of an analysis operator with its mutual coherence and its condition 
number. This, in turn allows to select the optimal weighting parameters n and ji for GOAL. Using this parameters, 
we learn one general analysis operator fi* by GOAL, and compare its image denoising performance with other 
analysis approaches. In the second part, we employ this fi* unaltered for solving two classical image reconstruction 
tasks of image inpainting and single image super-resolution, and compare our results with the currently best analysis 



approach FoE |27|, and state-of-the-art methods specifically designed for each respective application. 



A. Global Parameters and Image Reconstruction 

To quantify the reconstruction quality, as usual, we use the peak signal-to-noise ratio 
PSNR = 101og(255 2 A^/^ 1 (5 i - s*) 2 ). Moreover, we measure the quality using the Mean Structural SIMilarity 
Index (MSSIM) [35], with the same set of parameters as originally suggested in (35|. Compared to PSNR, the 
MSSIM better reflects a human observer's visual impression of quality. It ranges between zero and one, with one 
meaning perfect image reconstruction. 

Throughout all experiments, we fixed the size of the image patches to (8x8), i.e. n = 64. This is in accordance 
to the patch-sizes mostly used in the literature, and yields a good trade-off between reconstruction quality and 



numerical burden. Images are reconstructed by solving the minimization problem (|49|) via the conjugate gradient 
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method proposed in p4|. Considering the pixel intensity bounds, we used b\ = and b u = 255, which is the 



common intensity range in 8-bit grayscale image formats. The sparsity promoting function (|46|) with p = 0.4 and 
v = 10 -6 is used for both learning the analysis operator by GOAL, and reconstructing the images. Our patch based 



reconstruction algorithm as explained in Section [IV] achieves the best results for the maximum possible overlap 
dh = d v = 1. The Lagrange multiplier A and the measurements matrix A depend on the application, and are briefly 
discussed in the respective subsections. 

B. Analysis Operator Evaluation and Parameter Selection 

For evaluating the quality of an analysis operator and for selecting appropriate parameters for GOAL, we choose 
image denoising as the baseline experiment. The images to be reconstructed have artificially been corrupted by 
additive white Gaussian noise (AWGN) of varying standard deviation a noise . This baseline experiment is further 
used to compare GOAL with other analysis operator learning methods. We like to emphasize that the choice of 
image denoising as a baseline experiment is not crucial neither for selecting the learning parameters, nor for ranking 
the learning approaches. In fact, any other reconstruction task as discussed below leads to the same parameters and 
the same ranking of the different learning algorithms. 

For image denoising, the measurement matrix A in Equation ([49]) is the identity matrix. As it is common in the 
denoising literature, we assume the noise level <J no ise to be known and adjust A accordingly. From our experiments, 
we found that A = ^ftt * s a g°°d choice. We terminate our algorithm after 6 — 30 iterations depending on the noise 
level, i.e. the higher the noise level is the more iterations are required. To find an optimal analysis operator, we 
learned several operators with varying values for /x, ft, and k and fixed all other parameters according to Subsection 



V-A| Then, we evaluated their performance for the baseline task, which consists of denoising the five test images, 
each corrupted with the five noise levels as given in Table [I] As the final performance measure we use the average 
PSNR of the 25 achieved results. The training set consisted of M = 200 000 image patches, each normalized to 
unit Euclidean norm, that have randomly been extracted from the five training images shown in Figure [3] Certainly, 
these images are not considered within any of the performance evaluations. Each time, we initialized GOAL with a 
random matrix having normalized rows. Tests with other initializations like an overcomplete DCT did not influence 
the final operator. 




Fig. 3. Five training images used for learning the analysis operator. 



The results showed, that our approach clearly benefits from over-completeness. The larger we choose k, the 
better the operator performs with saturation starting at k = 2n. Therefore, we fixed the number of atoms for all 
further experiments to k = 2n. Regarding n and ji, note that by Lemma [3] and [4] these parameters influence the 
condition number and the mutual coherence of the learned operator. Towards answering the question of what is 
a good condition number and mutual coherence for an analysis operator, Figure |^[a)| shows the relative denoising 
performance of 400 operators learned by GOAL in relation to their mutual coherence and condition. We like to 
mention that according to our experiments, this relation is mostly independent from the degree of over-completeness. 
It is also interesting to notice that the best learned analysis operator is not a uniformly tight frame. The concrete 
values, which led to the best performing analysis operator f2* E R 128x64 in our experiments are k = 9000 and 
ji = 0.01. Its singular values are shown in Figure |^b)] and its atoms are visualized in Figure [5] This operator f2* 
remains unaltered throughout all following experiments in Subsections |V-C| - |V-E 



C. Comparison with Related Approaches 

The purpose of this subsection is to rank our approach among other analysis operator learning methods, and 
to compare its performance with state-of-the-art denoising algorithms. Concretely, we compare the denoising 
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1.4 1.6 1.8 2 
Condition Number 

(a) 




(b) 



Fig. 4. (a) Performance of 400 analysis operators learned by GOAL in relation to their mutual coherence and their condition number. Color 
ranges from dark blue (worst) to dark red (best). The green dot corresponds to the best performing operator Q*. (b) Singular values of Q* 



Fig. 5. Learned atoms of the analysis operator Q*. Each of the 128 atoms is represented as a 8 x 8 square, where black corresponds to 
the smallest negative entry, gray is a zero entry, and white corresponds to the largest positive entry. 



performance using fi* learned by GOAL with total- variation (TV) |36| which is the currently best known analysis 



operator, with the recently proposed method AOL (25J, and with the currently best performing analysis operator 
FoE [[27). Note that we used the same training set and dimensions for learning the operator by AOL as for GOAL. 
For FoE we used the same setup as originally suggested by the authors. Concerning the required computation 
time for learning an analysis operator, for this setting GOAL needs about 10-minutes on an Intel Core i7 3.2 GHz 
quad-core with 8GB RAM. In contrast, AOL is approximately ten times slower, and FoE is the computationally 
most expensive method requiring several hours. All three methods are implemented in unoptimized Matlab code. 

The achieved results for the five test images and the five noise levels are given in Table [I] Our approach achieves 
the best results among the analysis methods both regarding PSNR, and MSSIM. For a visual assessment, Figure [6] 
exemplarily shows some denoising results achieved by the four analysis operators. 

To judge the analysis methods' denoising performance globally, we additionally give the results achieved by 
current state-of-the-art methods BM3D [37] and K-SVD Denoising [12], which are specifically designed for the 
purpose of image denoising. In most of the cases our method performs slightly better than the K-SVD approach, 
especially for higher noise levels, and besides of the "barabara" image it is at most « 0.5dB worse than BM3D. 
This effect is due to the very special structure of the "barbara" image that rarely occurs in natural images, which 
are smoothed by the learned operator. 
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(c) TV (36), PSNR 29.63dB MSSIM 0.795 (d) FoE [27|, PSNR 29.75dB MSSIM 0.801 



Fig. 6. Images exemplarily showing the typical artifacts created by the four compared analysis operators for image denoising ("man" image, 
o noise — 20). For a better visualization a close up is provided for each image. 

D. Image Inpainting 

In image inpainting as originally proposed in (2), the goal is to fill up a set of damaged or disturbing pixels such 
that the resulting image is visually appealing. This is necessary for the restoration of damaged photographs, for 
removing disturbances caused by e.g. defective hardware, or for deleting unwanted objects. Typically, the positions 
of the pixels to be filled up are given a priori. In our formulation, when N — m pixels must be inpainted, this 
leads to a binary mx N dimensional measurements matrix A, where each row contains exactly one entry equal to 
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TABLE I 

Achieved PSNR in decibels (dB) and MSSIM for denoising five test images corrupted by five noise levels. Each cell 

CONTAINS THE ACHIEVED RESULTS FOR THE RESPECTIVE IMAGE WITH SIX DIFFERENT ALGORITHMS, WHICH ARE: TOP LEFT GOAL, 
TOP RIGHT AOL (26), MIDDLE LEFT TV (36), MIDDLE RIGHT FOE (27), BOTTOM LEFT K-SVD DENOISING (12) BOTTOM RIGHT 

BM3D [37] 



Onoise 1 PSNR 


lena 


barbara 


man 


boat 


couple 




IVloollvl 




lVlOOllVl 




IVloollvl 




IVloollvl 




IVlooliVl 


5 / 34.15 


38.65 36.51 
37.65 38.19 
38.48 38.45 


0.945 0.924 
0.936 0.938 
0.944 0.942 


37.96 35.95 
35.56 37.25 
38.12 38.27 


0.962 0.944 
0.948 0.958 
0.964 0.964 


37.77 35.91 
36.79 37.45 
37.51 37.79 


0.954 0.932 
0.944 0.949 
0.952 0.954 


37.09 35.77 
36.17 36.33 
37.14 37.25 


0.938 0.926 
0.925 0.917 
0.939 0.938 


37.43 35.55 
36.26 37.06 
37.24 37.14 


0.951 0.932 
0.940 0.944 
0.950 0.951 


10 / 28.13 


35.58 32.20 
34.24 35.12 
35.52 35.79 


0.910 0.856 
0.890 0.901 
0.910 0.915 


33.98 31.27 
30.84 32.91 
34.56 34.96 


0.930 0.883 
0.886 0.923 
0.936 0.942 


33.88 31.33 
32.90 33.44 
33.64 33.97 


0.907 0.851 
0.884 0.893 
0.901 0.907 


33.72 31.24 
32.54 33.23 
33.68 33.91 


0.883 0.842 
0.863 0.868 
0.883 0.887 


33.75 30.87 
32.32 33.37 
33.62 33.86 


0.903 0.844 
0.878 0.889 
0.901 0.909 


20 / 22.11 


32.63 28.50 
31.09 31.97 
32.39 32.98 


0.869 0.772 
0.827 0.856 
0.861 0.875 


30.17 27.26 
26.79 28.39 
30.87 31.78 


0.880 0.791 
0.773 0.849 
0.881 0.905 


30.44 27.33 
29.63 29.75 
30.17 30.59 


0.831 0.720 
0.795 0.801 
0.814 0.833 


30.62 27.16 
29.30 29.96 
30.44 30.89 


0.819 0.711 
0.778 0.793 
0.805 0.825 


30.39 26.95 
28.87 29.77 
30.08 30.68 


0.833 0.727 
0.783 0.807 
0.817 0.847 


25 / 20.17 


31.65 27.47 
30.05 30.87 
31.33 32.02 


0.854 0.742 
0.796 0.836 
0.842 0.859 


29.05 26.08 
25.73 27.05 
29.59 30.72 


0.856 0.750 
0.724 0.813 
0.850 0.887 


29.43 26.28 
28.66 28.62 
29.14 29.62 


0.801 0.677 
0.759 0.761 
0.780 0.804 


29.61 26.08 
28.32 28.87 
29.36 29.92 


0.792 0.671 
0.744 0.758 
0.772 0.801 


29.32 25.81 
27.87 28.57 
28.92 29.65 


0.802 0.679 
0.746 0.767 
0.780 0.820 


30 / 18.59 


30.86 26.50 
29.40 30.00 
30.44 31.22 


0.839 0.717 
0.786 0.823 
0.823 0.843 


27.93 24.95 
24.91 25.97 
28.56 29.82 


0.818 0.706 
0.690 0.787 
0.821 0.868 


28.64 25.30 
27.95 27.85 
28.30 28.87 


0.774 0.638 
0.736 0.740 
0.750 0.780 


28.80 25.07 
27.56 28.01 
28.48 29.13 


0.769 0.630 
0.720 0.737 
0.744 0.779 


28.46 24.79 
27.09 27.70 
27.95 28.81 


0.780 0.633 
0.715 0.743 
0.746 0.795 



TABLE II 

Results achieved for inpainting three test images with varying number of missing pixels using three different 
methods. in each cell, the psnr in db and the mssim are given for goal (top), foe [27 ] (middle), and method (38) 

(BOTTOM). 



% of missing pixels 


lena 


boat 


man 


PSNR 


MSSIM 


PSNR 


MSSIM 


PSNR 


MSSIM 


0.90% 


28.57 


0.840 


25.61 


0.743 


26.35 


0.755 


28.06 


0.822 


25.14 


0.719 


26.23 


0.747 


27.63 


0.804 


24.80 


0.683 


25.56 


0.715 


0.80% 


31.82 


0.895 


28.55 


0.833 


28.93 


0.847 


31.09 


0.880 


27.76 


0.804 


28.51 


0.836 


30.95 


0.878 


27.80 


0.804 


28.24 


0.821 


0.50% 


37.75 


0.956 


34.47 


0.936 


34.12 


0.947 


36.70 


0.947 


33.17 


0.907 


33.49 


0.940 


36.75 


0.943 


33.77 


0.918 


33.27 


0.934 


0.20% 


43.53 


0.985 


41.04 


0.982 


40.15 


0.985 


42.29 


0.981 


38.45 


0.963 


39.15 


0.982 


40.77 


0.965 


39.45 


0.966 


39.06 


0.977 



one. Its position corresponds to a pixel with known intensity. Hence, A reflects the available image information. 
Regarding A, it can be used in a way that our method simultaneously inpaints missing pixels and denoises the 
remaining ones. 

As an example for image inpainting, we disturbed some ground-truth images artificially by removing N — m 
pixels randomly distributed over the entire image as exemplary shown in Figure (^a)] In that way, the reconstruction 
quality can be judged both visually and quantitatively. We assumed the data to be free of noise, and empirically 
selected A = 10 -2 . In Figure [TJ we show exemplary results for reconstructing the "lena" image from 10% of all 
pixels using GOAL, FoE, and the recently proposed synthesis based method |38|. Table [TT] gives a comparison of 
further images and further number of missing pixels. It can be seen that our methods performs best independent 
of the configuration. 



E. Single Image Super-Resolution 

In single image super-resolution (SR), the goal is to reconstruct a high resolution image s E R N from an observed 
low resolution image y E M m . In that, y is assumed to be a blurred and downsampled version of s. Mathematically, 
this process can be formulated as y = VBs + e where V E R mx7V is a decimation operator and B E R 7Vx7V is 
a blur operator. Hence, the measurement matrix is given by A = VB. In the ideal case, the exact blur kernel is 
known or an estimate is given. Here, we consider the more realistic case of an unknown blur kernel. Therefore, to 
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(a) Masked 90% missing pixels. 



(b) Inpainted image GOAL, PSNR 28.57dB MSSIM 0.840. 




(c) Inpainted image FoE, PSNR 28.06dB MSSIM 0.822. (d) Inpainted image (38J, PSNR 27.63dB MSSIM 0.804. 

Fig. 7. Results for reconstructing the "lena" image from 10% of all pixels using Q* learny by GOAL, FoE, and [38]. 



apply our approach for magnifying an image by a factor of d in both vertical and horizontal dimension, we model 
the blur via a Gaussian kernel of dimension (2d — 1) x (2d — 1) and with standard deviation ouur — f • 

For our experiments, we artificially created a low resolution image by downsampling a ground-truth image by a 
factor of d using bicubic interpolation. Then, we employed bicubic interpolation, FoE, the method from (39j, and 
GOAL to magnify this low resolution image by the same factor d. This upsampled version is then compared with 



the original image in terms of PSNR and MSSIM. In Table [TTTj, we present the results for upsampling the respective 
images by d = 3. The presented results show that our method outperforms the current state-of-the-art. We want 
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TABLE III 

The results in terms of PSNR and MSSIM for upsampling the seven test images by a factor of d = 3 using five 

DIFFERENT ALGORITHMS GOAL, FoE (27), METHOD (39), AND BICUBIC INTERPOLATION. 



Method 


face 


august 


barbara 


lena 


man 


boat 


couple 


PSNR | MSSIM 


PSNR | MSSIM 


PSNR | MSSIM 


PSNR | MSSIM 


PSNR | MSSIM 


PSNR | MSSIM 


PSNR | MSSIM 


GOAL 


32.37 | 0.801 


23.28 | 0.791 


24.42 | 0.731 


32.36 | 0.889 


29.48 | 0.837 


28.25 | 0.800 


27.79 | 0.786 


FoE 


32.19 | 0.797 


22.95 | 0.782 


24.30 | 0.727 


31.82 | 0.885 


29.17 | 0.832 


28.00 | 0.797 


27.64 | 0.782 


Method | 39] 


32.16 | 0.795 


22.90 | 0.771 


24.25 | 0.719 


32.00 | 0.881 


29.29 | 0.829 


28.04 | 0.793 


27.56 | 0.778 


Bicubic 


31.57 | 0.771 


22.07 | 0.724 


24.13 | 0.703 


30.81 | 0.863 


28.39 | 0.796 


27.18 | 0.759 


26.92 | 0.743 



to emphasize that the blur kernel used for downsampling is different from the blur kernel used in our upsampling 
procedure. 

Note that many single image super-resolution algorithms rely on clean noise free input data, whereas the general 
analysis approach as formulated in Equation ([49]) naturally handles noisy data, and is able to perform simultaneous 
upsampling and denoising. In Figure [8] we present the result for simultaneously denoising and upsampling a low 
resolution version of the image "august" by a factor of d = 3, which has been corrupted by AWGN with a noise = 8. 
As it can be seen, our method produces the best results both visually and quantitatively, especially regarding the 
MSSIM. Due to high texture this image is hard to upscale even when no noise is present, see the second column 
of Table [Hi] Results obtained for other images confirm this good performance of GOAL but are not presented here 
due to space limitation. 



VI. Conclusion 

This paper deals with the topic of learning an analysis operator from example image patches, and how to apply 
it for solving inverse problems in imaging. To learn the operator, we motivate an ^-minimization on the set of 
full-rank matrices with normalized columns. A geometric conjugate gradient method on the oblique manifold is 
suggested to solve the arising optimization task. Furthermore, we give a partitioning invariant method for employing 
the local patch based analysis operator such that globally consistent reconstruction results are achieved. For the 
famous tasks of image denoising, image inpainting, and single image super-resolution, we provide promising results 
that are competitive with and even outperform current state-of-the-art techniques. Similar as for the synthesis signal 
reconstruction model with dictionaries, we expect that depending on the application at hand, the performance of the 
analysis approach can be further increased by learning the particular operator with regard to the specific problem, 
or employing a specialized training set. 
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